This report explores the ‘Red Wines’ dataset, which is described as containing 1599 observations with 11 variables which measure chemical properties. There is also a quality rating score which is the variable of interest.

The main focus of the exploration will be the question: “What (chemical property) variables influence the quality score of red wines?” We will start with univariate investigation of the 12 variables, then proceed to bivariate and multivariate explorations of quality v. the other 11 variables. Finally, we will fit a linear model to predict the quality based on the other factors.

First, load the data and confirm the dimensions of the dataset and the variables.

## [1] 1599   13
## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Univariate Plots Section

There are 13 total columns in the dataset. Check further on the variable names and structure:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Of these 13 variables, 11 record the objective chemical properties, 1 (“X”) is an ID variable, and 1 (“quality”) records the subjective quality score.

Since “quality” is the value we are interested in predicting, check the distribution of it first in a histogram, then check the summary (min, max, mean, median, and quartile values):

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The “quality” score is of type int, and only takes discrete integer values. We see that “quality” appears to be normally distributed, with a median of 6.000 and mean of 5.636. Even though the min possible value according to the grading scale is 0 and the max is 10, in this dataset we see a min of 3 and a max of 8. Both median and mean are within the 5-6 range.

We will create a new factor variable “quality.factor” from the “quality” score which will help with some later boxplots.

We will also create the ordered factor variable “quality.rating” with the “quality” score divided into 3 descriptive levels: “low” (0 - 4), “average” (5 - 6), and “high” (7 - 10).

Since there are only 11 independent variables (from 13 total excluding “quality” and “X”), we can easily individually check the distributions and summaries for each.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

“fixed.acidity” distribution is right-skewed with a median of 7.9 and mean of 8.32

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

“volatile.acidity” distribution is mostly normal with a median of 0.52 and mean of 0.5278, and some outliers on the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

“citric.acid” is slightly right-skewed with median of 0.260 and mean of 0.271

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

“residual.sugar” is right-skewed with median of 2.200 and mean of 2.539. There are many outliers on the right, including a high max of 15.500, so we also include a plot of the 95% percentile values. Even without these outliers, the distribution still looks right-skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

“chlorides” would appear normal except for the presence of values on the right which makes it right-skewed, with median of 0.07900 and mean of 0.08747. The max is very high compared with the majority values, at 0.61100. So because of the numerous outliers, we include remove the more extreme ones by plotting the 95% percentile values. We can see that without these outliers, the distribution looks normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

“free.sulfur.dioxide” is right-skewed, with median of 14.00 and mean of 15.87. There are some outliers including the max of 72.00, so we add an additional plot of the 99% percentile values and see that the distribution does not change much.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

“total.sulfur.dioxide” is right-skewed with median of 38.00 and mean of 46.67. There are some outliers including the max of 289.00, so we add a plot of the 99% percentile to zoom in on the distribution - which still looks right-skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

“density” is normally distributed with median of 0.9968 and mean of 0.9967

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

“pH” is also normally distributed, with median of 3.310 and mean of 3.311

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

“sulphates” is right-skewed with median of 0.6200 and mean of 0.6581. Some outliers, including the max of 2.000 on the far right. Because of these outliers, we also create a plot of the 99% percentile values (higher than 95% to avoid cutoff of some values on the plot) to zoom in on the central portion, which remains right-skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

“alcohol” is also right-skewed, with median of 10.20 and mean of 10.42


Univariate Analysis

What is the structure of your dataset?

There are 1599 observations of red wines, and 13 total variables. Of these, only 12 are relevant, as the variable “X” is an identifier. The variable “quality” is the dependent variable of interest, since we are interested in predicting “quality” using the other 11 variables which measure chemical properties of the red wines.

  • All 11 chemical property variables are numeric, there are no categoric variables.
  • There are no missing / NA values in the dataset.
  • The distributions are either normal (“quality”, “volatile.acidity”, “chlorides”, “density”, “pH”) or right-skewed (“fixed.acidity”, “citric.acid”, “residual.sugar”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “sulphates”, “alcohol”)

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the red wines dataset is the “quality” score. We are interested in seeing which of the other 11 features are related to the “quality”, and also in creating a predictive model for “quality” based on the relevant variables out of the 11.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Not knowing much about wines or alcohol, each of the 11 chemical variable properties is equally likely to influence the “quality” score. We will find out more in the bivariate investigation section.

Did you create any new variables from existing variables in the dataset?

The new categoric variable “quality.factor” was created from the integer/numeric variable “quality”. This is justified since “quality” only has discrete integer values.

The new categoric variable “quality.rating” was also created from “quality”, but this time divides the score into ratings of “low” (0 - 4), “average” (5 - 6), and “high” (7 - 10).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

This was a relatively tidy dataset with no missing values, and no cleaning operations were necessary. The only adjustments were:

  • adding the “quality.factor” and “quality.rating” categoric variables to use in later boxplots
  • showing additional plots limited to the 95% percentile values for “residual.sugar” and “chlorides”, and 99% percentile for “free.sulfur.dioxide”, “total.sulfur.dioxide”, and “sulphates” to get a better look at the distribution.

Bivariate Plots Section

Since we are interested in predicting the output variable “quality”, most of this section will be plotting the 11 checmical property variables against “quality”. But before doing that, first plot a scatterplot matrix of all the variables (13 - exclude “X”) against each other to see if there are any other interesting relationships:

Since the original 12 variables are numeric, ggpairs will show scatterplots in the matrix, and boxplots against the newly created factor variable “quality.factor”

From the correlation matrix, we can observe the pairs of variables that are highly correlated:

Now, let’s plot the 11 chemical property variables against the “quality” score, and the “quality.rating” factor.

“fixed.acidity” - a weak positive trend (0.124) as “quality”" increases, more obvious on the “quality.rating” plot.

“volatile.acidity” - a negative trend (-0.391) as “quality” / “quality.rating” increases. This fits the strong negative correlation coefficient from the earlier matrix.

“citric.acid” - a postive trend (0.226) to both “quality” and “quality.rating” measures. But then, “citric.acid” is strongly correlated with the earlier term “volatile.acidity” (coefficient of -0.552)

“residual.sugar” - many outliers at high values which makes the plot harder to read. No real positive nor negative trend can be seen (0.014). We create an additional view with 95% percentile and the trend still appears level.

“chlorides” - again, many high value outliers make the plot hard to read, but from the interquartile ranges of the boxpot we can see a weak negative trend (-0.129). The 95% percentile plot shows that there is a slight negative trend.

“free.sulfur.dioxides” - average “quality.rating” has higher “free.sulfur.dioxide” values than either low or high ratings. Since the relationship isn’t linear, we don’t expect a high correlation coefficient either (-0.051). The 99% percentile plot zooms in on the nonlinear relationship.

“total.sulfur.dioxides” - same as the plot for “free.sulfur.dioxides”, which makes sense as “total.sulfur.dioxides” is strongly correlated with it (0.668). No linear relation to “quality” and “quality.rating” (though the correlation coefficient is -0.185, likely due to the decreasing mean when moving from “quality” score 5 to 6). Again, the 99% percentile plot shows a nonlinear relationship.

“density” - a slight negative trend (-0.175)

“pH” - also a slight negative trend, though the correlation coefficient is low (-0.058). But we know that “pH” is correlated with “fixed.acidity” (-0.683), “volatile.acidity” (0.235), and “citric.acid” (-0.542) which all have slight linear trends with “quality” / “quality.rating”

“sulphates” - a postive linear trend (0.251), which is made more obvious on the 99% percentile plots.

“alcohol” - a rather strong postive linear trend (0.476). “alcohol” measure remains relatively level from score 3 to score 5, then increases as the “quality” increases from 6 to 8.

Let’s summarise the 11 chemical properties across the 3 subjective quality ratings (“quality.rating”) in a combined boxplot. Remove the extreme outliers by limiting the y-axis to the 95% and 99% percentile values for the variables with outliers that we examined in the previous part.

“residual.sugar” and “chlorides” were limited to the 95% percentile.

“free.sulfur.dioxide”, “total.sulfur.dioxide”, and “sulphates” were limited to the 99% percentile.

By plotting these next to each other, we can see which trends are more obvious, and which are not.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From the scatterplots with boxplot overlays and the correlation analysis, we can make some generalisations about what factors are more likely to affect the “quality” measure:

  • factors that appear to affect “quality” (strong trend) :

    • volatile.acidity (-0.391)
    • citric.acid (0.226)
    • sulphates (0.251)
    • alcohol (0.476, highest correlation factor)
  • factors that might affect “quality” (slight trend) :

    • fixed.acidity (0.124)
    • chlorides (-0.129)
    • total.sulfur.dioxide (-0.185)
    • density (-0.175)
    • pH (even though the correlation coefficient is low at -0.058)
  • factors that do not appear to affect “quality” (no trend or nonlinear trend) :

    • residual.sugar (0.014)
    • free.sulfur.dioxide (-0.051)

Often, it was easier to see a relationship on the “quality.rating” plots which only have 3 levels compared to the “quality” plots with 6 levels

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There are some chemical property variables which are strongly correlated with each other, which makes sense since they measure properties which are closely related:

  • “free.sulfur.dioxide” and “total.sulfur.dioxide”
  • “fixed.acidity”, “volatile.acidity”, “citric.acid” and “pH”

Some strong relationships probably have a chemical explanation:

  • “fixed.acidity”, “volatile.acidity”, “citric.acid”, “pH” and “density”
  • “citric.acid” and “sulphates”
  • “chlorides” and “sulphates”

What was the strongest relationship you found?

These were the strongest correlations among all the variables:

  • pH : citric.acid (-0.542) - not surprising for the neative correlation since pH is related to acidity (less pH = more acidity)
  • pH : fixed.acidity (-0.683) - same as above
  • density : fixed.acidity (0.668) - there must be a chemical explanation for this, which is that acid is more dense than water
  • total.sulfur.dioxide : free.sulfur.dioxide (0.668) - not surprising for the strong positive correlation, both are sulfur measures
  • citric.acid : volatile.acidity (-0.552) - not surprising that the acid level is related to an acidity measure, but surprising that the coefficient is negative
  • citric.acid : fixed.acidity (0.672) - the coefficient sign is positive, surprising since it is negative for “volatile.acidity”

The strongest correlations compared to “quality” are:

  • quality : alcohol (0.476)
  • quality : sulphates (0.251)
  • quality : citric.acid (0.226)
  • quality : volatile.acidity (-0.391)

Multivariate Plots Section

From the bivariate analysis section, we see that the strongest contenders for influencing the “quality” score are, in order:

Since “alcohol”" has the strongest relationship with “quality”“, let’s plot the other variables here against”alcohol"" to see if they have any influence on the “quality.rating”, holding “alcohol” constant:

“alcohol” v “sulphates” : Sulphate values were limited to 99% percentile (from earlier exploration). Holding the alcohol level constant, higher sulphates values seem to lead to higher quality ratings.

“alcohol” v “citric.acid” : Not as clear as for the last variable due to the presence of outliers, but generally higher citric acid value will have higher quality ratings.

“alcohol” v “volatile.acidity” : Lower volatile acidity is associated with higher quality ratings. Most of the low quality rating red wines have a high volatile acidity value.

“citric.acid” v “volatile.acidity” : During the bivariate analysis, we saw that these 2 variables had a strong negative correlation (-0.552). So we plot them together here and see that the negative trend is visible, along with confirming the finding from the previous plot (“alcohol” v “volatile.acidity”) that lower volatile acidity has higher quality ratings.

Now let’s take a brief look at the factors which have might affect the “quality” score, which had a weaker trend than the previous factors:

“alcohol” v “fixed.acidity” : Almost no correlation (-0.062). But we see that higher fixed acidity values usually have higher quality ratings. This might be related to the negative relationship that alcohol has with volatile acidity that we saw earlier, since “fixed.acidity” and “volatile.acidity” has a negative correlation (-0.256) which means that we expect a positive relationship between “alcohol” and “fixed.acidity”

“alcohol” v “chlorides” : We see in the plot a weaker negative correlation (-0.221). There were many higher value outliers for “chloride” so we limited to the 95% percentile.

“alcohol” v “total.sulfur.dioxide” : Weak negative correlation (-0.206), not really visible in the plot due to the dispersal and high quality line being in the middle. There were many higher value outliers for “total.sulfur.dioxide” so we only plotted the 99% percentile.

“alcohol” v “density” : A strong negative correlation (-0.496). In the plot we see that with constant alcohol value, lower density is slightly associated with lower quality rating.

“alcohol” v “pH” : A weak correlation (0.206), and we see that for the same alcohol value, higher pH wines generally have lower quality ratings.

We can also check several groups of variables that have some relationship with each other, that we uncovered from the correlation plot in the bivariate analysis section:

“fixed.acidity”, “volatile.acidity”, “citric.acid”, “pH”, “density” : These are the acid / acidity / pH measures, plus density. We can see that there is some trend (either positive or negative) among these pairs, except for “volatile.acidity” v “density” which is relatively level. From the overlapping histograms, we can see that higher “quality.rating”" wines have lower “volatile.acidity” and higher “citric.acid” values.

“citric.acid”, “chlorides”, “sulphates” : Positive correlation among these 3 variables. The histogram shows that higher “quality.rating” is associated with higher “citric.acid”, lower “chlorides” and higher “sulphates”.

“free.sulfur.dioxide”, “total.sulfur.dioxide”, “residual.sugar” : Positive correlation among all 3 variables, but the histograms don’t really show any relationship of the “quality.rating” score with these variables.

Linear Regression Model

Let’s create a linear regression model to predict the “quality” score using all the chemical property variables (full model):

## 
## Call:
## lm(formula = quality ~ ., data = red_subset2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68911 -0.36652 -0.04699  0.45202  2.02498 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.197e+01  2.119e+01   1.036   0.3002    
## fixed.acidity         2.499e-02  2.595e-02   0.963   0.3357    
## volatile.acidity     -1.084e+00  1.211e-01  -8.948  < 2e-16 ***
## citric.acid          -1.826e-01  1.472e-01  -1.240   0.2150    
## residual.sugar        1.633e-02  1.500e-02   1.089   0.2765    
## chlorides            -1.874e+00  4.193e-01  -4.470 8.37e-06 ***
## free.sulfur.dioxide   4.361e-03  2.171e-03   2.009   0.0447 *  
## total.sulfur.dioxide -3.265e-03  7.287e-04  -4.480 8.00e-06 ***
## density              -1.788e+01  2.163e+01  -0.827   0.4086    
## pH                   -4.137e-01  1.916e-01  -2.159   0.0310 *  
## sulphates             9.163e-01  1.143e-01   8.014 2.13e-15 ***
## alcohol               2.762e-01  2.648e-02  10.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16
##        fixed.acidity     volatile.acidity          citric.acid 
##             7.767512             1.789390             3.128022 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##             1.702588             1.481932             1.963019 
## total.sulfur.dioxide              density                   pH 
##             2.186813             6.343760             3.329732 
##            sulphates              alcohol 
##             1.429434             3.031160

In the summary, we can see which variables are significant to the model:

  • volatile.acidity ***
  • chlorides ***
  • free.sulfur.dioxide *
  • total.sulfur.dioxide ***
  • pH *
  • sulphates ***
  • alcohol ***

Though there are many significant variables, the adjusted R-squared is relatively low at 0.3561, which means that it is not a great predictor for variations in the output variable.

Several variables have high VIF values, which means that there is multicollinearity among the variables of the full model.

For the next model iteration, we will decrease the number of predictor variables by only including the variables that were marked signifcant from the full model’s summary.

## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide + 
##     free.sulfur.dioxide + pH + sulphates + alcohol, data = red_subset2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68918 -0.36757 -0.04653  0.46081  2.02954 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.4300987  0.4029168  10.995  < 2e-16 ***
## volatile.acidity     -1.0127527  0.1008429 -10.043  < 2e-16 ***
## chlorides            -2.0178138  0.3975417  -5.076 4.31e-07 ***
## total.sulfur.dioxide -0.0034822  0.0006868  -5.070 4.43e-07 ***
## free.sulfur.dioxide   0.0050774  0.0021255   2.389    0.017 *  
## pH                   -0.4826614  0.1175581  -4.106 4.23e-05 ***
## sulphates             0.8826651  0.1099084   8.031 1.86e-15 ***
## alcohol               0.2893028  0.0167958  17.225  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared:  0.3595, Adjusted R-squared:  0.3567 
## F-statistic: 127.6 on 7 and 1591 DF,  p-value: < 2.2e-16
##     volatile.acidity            chlorides total.sulfur.dioxide 
##             1.241819             1.333333             1.943920 
##  free.sulfur.dioxide                   pH            sulphates 
##             1.882706             1.254570             1.321931 
##              alcohol 
##             1.220157

This second model has significance of close to 0 for all the included predictor variables, except for “free.sulfur.dioxide” with significance of * (<= 0.01). Adjusted R-squared has increased slightly, to 0.3567

The VIF values are now lower (< 2) for all the variables in the model.

Let’s try removing “free.sulfur.dioxide” in the next model since we already included “total.sulfur.dioxide” (which is highly correlated with “free.sulfur.dioxide”).

## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide + 
##     pH + sulphates + alcohol, data = red_subset2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60575 -0.35883 -0.04806  0.46079  1.95643 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.2957316  0.3995603  10.751  < 2e-16 ***
## volatile.acidity     -1.0381945  0.1004270 -10.338  < 2e-16 ***
## chlorides            -2.0022839  0.3980757  -5.030 5.46e-07 ***
## total.sulfur.dioxide -0.0023721  0.0005064  -4.684 3.05e-06 ***
## pH                   -0.4351830  0.1160368  -3.750 0.000183 ***
## sulphates             0.8886802  0.1100419   8.076 1.31e-15 ***
## alcohol               0.2906738  0.0168108  17.291  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6487 on 1592 degrees of freedom
## Multiple R-squared:  0.3572, Adjusted R-squared:  0.3548 
## F-statistic: 147.4 on 6 and 1592 DF,  p-value: < 2.2e-16
##     volatile.acidity            chlorides total.sulfur.dioxide 
##             1.227967             1.332977             1.053830 
##                   pH            sulphates              alcohol 
##             1.218707             1.321237             1.218733

In this third model, all 6 predictors have significance of close to 0. However, the adjusted R-squared has decreased slightly to 0.3548. So we know that the 6 predictors are very likely to influence the quality score, but the explantory level of our model is still not good. There are likely to be variables we are missing that are needed in order to create a better model with more explanatory power.

The VIF values are lower still, now all < 1.5 which means a low level of multicollinearity.

## [1] "summary : actual quality values"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## [1] "summary : predicted quality values from model 2"
## Length  Class   Mode 
##      0   NULL   NULL
## [1] "quality rating distribution"
##     low average    high 
##      63    1319     217

This is a plot of the error (predicted - actual) values from using model 2 which had the highest adjusted R-squared. We can see that from the plot and the summary tables that the model overpredicts for low scores and underpredicts for high scores (min 4.304 in prediction versus min 3 in actual, max 7.342 in prediction versus max 8 in actual). This clustering around the middle/average region of 5-6 scores likely is due to the concentration of our data points in this range - 1319 out of 1499 (87.99%) of our datapoints are in the middle/average range.

Now let’s look at the density plots of each of these variables from model_2 to see if the trends are noticeable and match with the coefficients from the linear model:

We multiplot some density plots for the variables in the linear model. Even though these are bivariate plots, the variables were decided as a result of running the linear regression models.

We can see the direction and strength of trends - for example, the positive / negative effects of “volatile.acidity” and “sulphates” are more obvious than for “total.sulfur.dioxide” and “free.sulfur.dioxide” whose plots don’t really show any trend.


Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From the bivariate analysis, we saw that “alcohol” has the highest correlation with the “quality” score. So we explored the other variables compared to “alcohol”, and splitting or faceting by “quality.rating”. We saw some noticeable trends associated with the higher ratings:

  • higher sulphates
  • higher citric.acid
  • lower volatile.acidity
  • higher fixed.acidity
  • lower chlorides
  • higher density
  • lower pH

Very small effects from:

  • total.sulfur.dioxide

Were there any interesting or surprising interactions between features?

  • “fixed.acidity” has low correlation with “quality” (0.124), and almost no correlation with “alcohol” (-0.062), but when we plot it against “alcohol” we can see a positive trend - with constant alcohol value, higher fixed acidity values usually have higher quality ratings. Probably due to the negative relationship that “alcohol”" has with “volatile acidity” (-0.202) and the negative correlation between “fixed.acidity” and “volatile.acidity” (-0.256)
  • “density” has negative correlation with “quality” (-0.175), but when plotted against alcohol, higher density values have higer quality ratings. Likely due to the strong negative correlation between “alcohol” and “density” (-0.496)
  • “total.sulfur.dioxide” and “free.sulfur.dioxide” have low correlations with “quality”" (-0.185, -0.051), but still appear in the linear model

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

There were 3 linear models created - a full model (model_1), a model removing the non-significant factors from the full model (model_2), and a model removing an additional factor “free.sulfur.dioxide” (model_3).

The model with the highest adjusted R-squared (meaning explantory power) was model_2, though this value was low at 0.3567. But VIF analysis showed that the 7 factors in this model did not have a high degree of multicollinearity, so we use these variables which are all significant:

  • volatile.acidity
  • chlorides
  • total.sulfur.dioxide
  • free.sulfur.dioxide
  • pH
  • sulphates
  • alcohol

Model strength(s): The low significance values show that these chemical properties are very likely related to the quality score, which is a goal of the exploration (to see which variables influence the quality)

Model weakness(es): As a linear regression model, the output is a numeric number with decimals even though the quality scale is a discrete integer. The explanatory power is low (low adjusted R-squared). We have a lot of data (~88%) with average quality scores (5, 6), but not a lot of data at the low and high ends (<= 4, >= 7) which means our model is not very good predicting low and high values - as seen in the error plot where it overestimates bad quality wines and underestimates good quality wines.


Final Plots and Summary

Plot One

Description One

This is a summary of the 11 chemical properties across the 3 subjective quality ratings (“quality.rating” - “low”, “average”, “high”) in a combined scatterplot and boxplot. The extreme outliers were removed by limiting the y-axis to the 95% percentile values for “residual.sugar” and “chlorides”, and to the 99% percentile values for “free.sulfur.dioxide”, “total.sulfur.dioxide”, and “sulphates”. We knew about the outliers for these 5 variables from the univariate exploration phase.

As we move across each individual plot from left to right, from quality rating “low” to “high”, we can see that some linear trends are more obvious, such as the negative trends for “volatile.acidity” and “pH”, and the positive trends for “citric.acid” and “alcohol”, while there are non-linear trends for “free.sulfur.dioxide” and “total.sulfur.dioxide”, and almost no trend for “residual.sugar”.

Because of the scatterplot, we can also see that most of our values are in the average “quality.rating” group, and that we have sparser data for low and high ratings.

Plot Two

Description Two

This is a density plot of the variables that were deemed significant by the linear model for “quality” (model 2 in the multivariate analysis section), coloured by the 3 “quality.rating” categories.

We can observe the distribution of these variables among the quality ratings and form some conclusions:

variable trend direction
(from plot)
strength
(from plot)
coefficient from
linear model
volatile.acidity negative strong -1.0127527
chlorides negative weak -2.0178138
total.sulfur.dioxide none -0.0034822
free.sulfur.dioxide none 0.0050774
pH negative weak -0.4826614
sulphates positive strong 0.8826651
alcohol positive strong 0.2893028

If a trend direction was seen in these density plots, the direction was also reflected in the model coefficient of the linear model. For example, the density plot of “volatile.acidity” shows a strong negative trend where higher quality wines have lower volatile acidity values. This negative trend matches the negative model coefficient.

All the variable trends in the plots here match with the direction of the linear model coefficients, which is expected.

Plot Three

Description Three

we plot the 6 chemical property variables from linear regression model 2 (excluding “alcohol”) against “alcohol” and “quality.rating” to see the influence of their values on the rating score, holding alcohol constant.

The 6 variables are:

  • chlorides
  • volatile.acidity
  • sulphates
  • pH
  • free.sulfur.dioxide
  • total.sulfur.dioxide

We also check the directions of the trends compared to “alcohol” (correlation values compared to alcohol, and the trend direction from the plot with alcohol and quality rating) and “quality” (correlation values compared to quality, and the coefficients from the linear model) :

variable correlation
w/ alcohol
trend compared
to alcohol
correlation
w/ quality
coefficient from
linear model
match?
chlorides -0.221 negative -0.129 -2.0178138 yes
volatile acidity -0.202 negative -0.391 -1.0127527 yes
sulphates 0.094 positive 0.251 0.8826651 yes
pH 0.206 negative -0.058 -0.4826614 no
total sulfur dioxide -0.206 nonlinear -0.185 -0.0034822 yes
free sulfur dioxide -0.069 nonlinear -0.051 0.0050774 no
alcohol 0.476 0.2893028

Points of note:

  • Generally, the signs on the correlation coefficients of the variables compared to “alcohol” and “quality”, along with the direction of the trend when plotted with “alcohol” and “quality” matches, except for 2 variables - “pH” and “free.sulfur.dioxide”
  • “pH” has positive correlation with “alcohol”, but when plotted against both “alcohol” and “quality.rating”, we see that for constant alcohol value, lower pH values are associated with higher quality ratings. This can be explained by the negative effect of “pH” on “quality” (even though the correlation to “quality” is small at -0.058, there is a rather large -0.483 coefficient calculated from the linear model).
  • “free.sulfur.dioxide” has a weak negative correlation with “alcohol” (-0.069) and a weak negative correlation with “quality” (-0.051), but in the linear model ends up with a small positive coefficient (0.0051)

Reflection

Where did I run into difficulties in the analysis?

The main difficulty was trying to build a good predictive model with the existing data, which ultimately failed since the explanatory power of the final model was low (model_2 had the highest adjusted R-squared at 0.3567). Also, the linear regression model predicts a numeric number with decimals, while the “quality” score is a discrete integer.

Also, since there was not a lot of data outside of the “average” quality score range (~88% of observations were “average” quality), it was hard to infer trends / make predictions for lower and higher quality cases.

Where did I find successes?

We were successful at least in identifying variables which are correlatd with the output variable “quality” (from the correlation plots) and which likely influence the “quality” score (have significance close to 0 in the linear regression model). The scatterplots and geom_smooth lines were very helpful with showing patterns in the data.

However, all this is with the caveat that correlation does not equal causation, and that we are working with a dataset limited in observations (unequal distribution among the quality ratings) and characteristics (other objective or subjective properties that were not included).

What was surprising?

  • That the adjusted R-squared ended up being so low (0.3567), even though the variables were significant
  • Even though “total.sulfur.dioxide” and “free.sulfur.dioxide” have low correlations with “quality”" (-0.185, -0.051), they are still included in the final model

How could the analysis be enriched in future work?

The analysis can be improved by getting additional information about the existing wines, and getting additional observations of lower (< 5) and higher (> 6) wines, even though this was not part of the provided dataset.

References

[1] dataset: https://docs.google.com/document/d/e/2PACX-1vRmVtjQrgEPfE3VoiOrdeZ7vLPO_p3KRdb_o-z6E_YJ65tDOiXkwsDpLFKI3lUxbD6UlYtQHXvwiZKx/pub?embedded=true
[2] variable description: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
[3] missing value check: https://stackoverflow.com/questions/38924002/r-check-if-na-exists-in-any-column-of-r-dataframe-row-then-if-so-remove-that
[4] conditionally replace values: https://stackoverflow.com/questions/32578082/r-how-to-replace-value-of-a-variable-conditionally [5] order factor variable: https://campus.datacamp.com/courses/introduction-to-r-for-finance/factors-4?ex=8#skiponboarding [6] ggpairs column label wrapping: https://stackoverflow.com/questions/43256948/wrap-column-name-text-in-ggpairs-in-r
[7] ggpairs plot colours: https://stackoverflow.com/questions/44426674/improving-the-readability-of-the-scatterplot-in-ggpairs-ggplot
[8] correlation matrix: https://stackoverflow.com/questions/45873483/ggpairs-plot-with-heatmap-of-correlation-values
[9] ggcorr options: https://rdrr.io/cran/GGally/man/ggcorr.html
[10] dark theme: http://www.sthda.com/english/wiki/ggplot2-themes-and-background-colors-the-3-elements
[11] using color brewer in ggpairs: https://stackoverflow.com/questions/22237783/user-defined-colour-palette-in-r-and-ggpairs
[12] add line to ggpairs: https://stackoverflow.com/questions/35085261/how-to-use-loess-method-in-ggallyggpairs-using-wrap-function
[13] corrplot title: https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggally/ggcorr/
[14] grid.arrange title: https://stackoverflow.com/questions/14726078/changing-title-in-multiplot-ggplot2-using-grid-arrange
[15] vif analysis: http://www.sthda.com/english/articles/39-regression-model-diagnostics/160-multicollinearity-essentials-and-vif-in-r/
[16] hide legend: https://stackoverflow.com/questions/35618260/remove-legend-ggplot-2-2
[17] common legend for multiple plots: https://stackoverflow.com/questions/13649473/add-a-common-legend-for-combined-ggplots
[18] ggarrange title: https://rpkgs.datanovia.com/ggpubr/reference/annotate_figure.html
[19] density plots: http://www.sthda.com/english/wiki/ggplot2-density-plot-quick-start-guide-r-software-and-data-visualization
[20] plot themes: https://ggplot2.tidyverse.org/reference/theme.html